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ABSTRACT 

4 

This  research  program  continued  an  investigation  of 
sensitivity  analysis,  and  its  use  in  the  segmentation  and 
identification  of  the  phonetic  units  of  speech,  that  was 
initiated  during  the  1982  Summer  Faculty  Research  Program. 

The  elements  of  the  sensitivity  matrix,  which  express  the 
relative  change  in  each  pole  of  the  speech  model  to  a 
relative  change  in  each  coefficient  of  the  characteristic 
equation,  were  evaluated  for  an  expanded  set  of  data  which 
consisted  of  six  vowels  contained  in  single  words  spoken  in 
a  simple  carrier  phrase  by  five  males  with  differing  dialects. 
The  objectives  were  to  evaluate  the  sensitivity  matrix, 
interpret  its  changes  during  the  production  of  the  vowels, 
and  to  evaluate  inter-speaker  variations.  It  was  deter¬ 
mined  that  the  sensitivity  analysis  (1)  serves  to  segment  the 
vowel  interval,  (2)  provides  a  measure  of  when  a  vowel  is  "on 
target,"  and  (3)  should  provide  sufficient  information  to 
identify  each  particular  vowel.  Based  on  the  results  pre¬ 
sented,  sensitivity  analysis  should  result  in  more  accurate 
segmentation  and  identification  of  phonemes  and  should  pro¬ 
vide  a  practicable  framework  for  incorporation  of  acoustic- 
phonetic  variance  as  well  as  time  and  talker  normalization. 
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DESCRIPTION  OF  RESEARCH  PROGRAM 


INTRODUCTION: 


There  are  several  general  approaches  used  in  current 

computer  based  continuous  speech-recognition  systems.  A 

33 

typical  system  uses  a  three-step  procedure.  In  the  first 

step,  a  word  or  phrase  is  divided  into  time  segments  or 

frames.  For  each  frame,  a  "best  fit"  is  determined  for  a 

particular  parametric  representation  (speech  model).  Next,  a 

statistical  decision  rule  is  used  to  tentatively  determine  (or 

estimate)  the  phoneme  corresponding  to  each  frame.  Third,  a 

set  of  phonological  decision  rules  is  used  to  combine  the 

phonemic  decisions  of  the  several  frames  and  access  lexical 

12 

candidates.  As  a  contrast,  Klatt  proposed  a  system  in  which 
samples  of  the  speech  waveform  are  analyzed  to  determine  a 
sequence  of  spectral  representations  which  are  directly 
decoded  into  lexical  candidates  by  a  network  constructed  from 
phonemic,  phonetic,  and  phonological  rules. 

12 

Regardless  of  the  speech  processing  system  used,  Klatt 
has  described  eight  problem  areas  that  must  be  overcome. 

These  are  (1)  acoustic-phonetic  variance,  (2)  segmentation  of 
the  signal  into  phonetic  units,  (3)  time  normalization,  (4) 
talker  normalization,  (5)  lexical  representations  for  optimal 
search,  (6)  phonological  recoding  of  words  in  sentences,  (7) 
dealing  with  errors  in  the  initial  phonetic  representation 
during  lexical  matching,  and  (8)  interpretation  of  prosodic 
cues  to  lexical  items  and  sentence  structure. 


i 
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This  research  program  was  focused  on  problem  2  and  its 
interplay  with  problems  1,  3,  and  4.  That  is,  this  research 
project  investigated  a  novel  2-level  scheme  for  more  accurate 
segmentation  of  the  speech  signal  into  phonetic  units.  It  was 
anticipated  that  this  scheme  would  allow  for  a  more  accurate 
application  of  decision  rules,  and  would  provide  a  practicable 
framework  for  incorporation  of  acoustic-phonetic  variance  as 
well  as  time  and  talker  normalization.  Although  the 
sensitivity  analysis  is  expressed  within  the  framework  of  a 
three-step  speech  analysis  system,  it  could  also  be  viewed  as 
an  alternative  to  the  sequence  of  spectra  as  utilized  by 
Klatt.  Furthermore,  the  techniques  developed  for  continuous 
speech-recognition  systems  may  also  provide  important 
implementation  advantages  when  compared  with  current  isolated 
word  or  word  spotting  systems. 


Since  the  current  study  was  limited  to  non-nasal  vowels, 
the  initial  parametric  representation  of  each  frame  consisted 
of  the  linear  prediction  coefficients  which  were  used  to 
express  the  coefficients  of  the  following  characteristic 
equation: 


q(s)  =  sn  +  a2  sn  1  + 


ans  *  a 


n+1 


(1) 


The  second  level  of  the  segmentation  depended  on  the 
evaluation  and  interpretation  of  a  sensitivity  matrix  with 
elements  defined  as: 


S(k, i) 


(2) 
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This  definition  easily  leads  to  the  following  closed-form 


expression: 


S(k,i)  = 


k-l-n 


+  I 

j=l 


r  .  +  P  .. 
1  jk 


i  1 ,  . ■ • ,  n 
k  2 ,  . . « ,  n  "t  1 

(3) 


where  r,  is  a  root  of  the  characteristic  equation  and  P.,  is 

1  J  K 

the  j-th  root  of  the  characteristic  equation  when  its  k-th 
coefficient  is  assigned  the  value  zero.  Thus  S(k,i)  expresses 
the  relative  change  in  the  location  of  filter  pole  i 
to  the  relative  change  in  coefficient  a^, . 


II.  SPECIFIC  OBJECTIVES: 

This  research  project  focused  on  a  selected  set  of  six 
vowels  contained  in  single  words,  spoken  in  a  simple  carrier 
phrase,  by  five  males  with  differing  dialects.  The  specific 
objectives  were  to  evaluate  and  interpret  the  changes  in  the 
sensitivity  matrix  that  occur  during  the  production  of  the 
vowels,  to  use  this  broader  data  set  to  test  the  conclusions 
reached  during  the  Summer  Faculty  Research  period,  and  to 
evaluate  inter-speaker  variations. 

It  was  necessary  to  (1)  record,  digitize,  and  frame  the 
data;  (2)  calculate  the  coefficients  of  the  characteristic 
equation  for  each  frame  via  linear  predictive  analysis;  (3) 
evaluate  the  sensitivity  matrix  for  each  frame;  (4)  determine 
if  the  sensitivity  analysis  provided  a  measure  of  the  degree 
to  which  a  vowel  was  "on  target";  (5)  determine  if  the 
sensitivity  matrix  can  be  used  to  identify  the  individual 
phonemes;  and  (6)  evaluate  inter-speaker  variations. 


ft 


III. 


GENERAL  BACKGROUND: 
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Automatic  speech  recognition  (ASR)  has  potential 

application  to  various  USAF  operational  problems.  Examples 

include  voice  control  of  devices  and  systems,  intelligence 

data  handling,  and  language  identification.  A  practical  ASR 

system  "must  operate  on  the  continuous  utterance  of  any  number 

of  speakers  in  moderate  or  even  poor  noise  environments." 

Major  sources  of  difficulty  include  acoustic-phonetic  variance 

and  segmentation  of  the  acoustic  signal.  The  following 

discussion  reviews  the  reasons  for  this  difficulty  and 

expresses  the  research  problem. 

Connected-speech-recognition  systems  have  utilized 

techniques  that  represent  sound  patterns  in  smaller  linguistic 

13 

units  than  words;  one  being  in  terms  of  phonemes.  However, 

the  location  and  specification  of  the  acoustic  characteristics 

2 

of  phonemes  has  been  a  central  problem  to  speech  recognition. 
The  problem  is  not  attributable  to  mechanical  limitations. 
Instead,  it  is  considered  to  be  the  result  of  the  very  nature 
of  human  speech  production  and  perception,  for  phonemes  are 
not  individual  sounds,  but  rather  classes  of  acoustically 
different  sounds  which  speakers  of  a  language  have  learned  to 
call  equivalent."* 

In  contrast  to  the  discrete  units  of  linguistic  analysis 
(i.e.,  sentences,  phrases,  words,  morphemes,  phonemes),  the 
(  acoustic  representation  of  an  utterance  is  semi-continuous. 

The  problem  for  those  studying  speech  recognition  is  to  map 
semi-continuous  acoustic  waveforms  into  discrete  linguistic 

4 

units.  Attempts  to  accomplish  this  task  have  revealed  a 
number  of  sources  of  variation  that  make  this  mapping 
difficult.  Lack  of  a  one-to-one  correspondence  between 

*  acoustic  segments  and  the  linguistic  units  they  represent  can 

'  be  attributable  to  (1)  coarticulation,  (2)  allophonic 

*  variation,  (3)  stress  and  rate  of  speech  production,  (4) 

|  individual  speaker  differences,  and  (5)  dialectical  variation. 


Coarticulation  can  be  defined  as  the  influence  of  one 

phoneme  upon  another. ^  Logically,  the  mismatch  between 

phoneme  and  acoustic  representation  caused  by  coarticulation 

can  be  divided  into  cases  in  which  (1)  a  phoneme  has  more  than 

one  acoustic  representation,  and  (2)  an  acoustic  segment  can 

represent  more  than  one  phoneme.  An  example  of  the  first  case 

are  vowels  that  have  nasal  cavity  resonances  and  antiresonance 

when  produced  in  nasal  consonant  contexts  (e.g.,  man)  but  not 

6  7 

in  others  (e.g.,  pat).  '  Speakers  of  a  language  group  these 

3 

different  acoustic  events  into  the  same  phoneme  class,  and  so 
must  speech  recognition  systems. 

An  example  of  the  second  kind  of  mismatch, in  which  a 
unique  acoustic  signal  can  correspond  to  one  of  several 
phonemes,  involves  the  noise  burst  frequency  due  to  place  of 
articulation  of  stop  consonants  in  CV  ( consonant-vowel ) 
syllables.  It  has  been  demonstrated  that  a  noise  burst  of  a 
particular  frequency  is  perceived  by  a  listener  as  /p/  when 
followed  by  /i/.  The  same  noise  burst  is  perceived  by  a 

g 

listener  as  /k/  when  the  following  vowel  was  /ae/.  As  a 
consequence  of  multiple  linguistic  interpretations  of  the  same 
acoustic  segment,  it  is  imperative  that  speech  recognition 
systems  be  able  to  defer  decisions  about  the  phonemic  identity 
of  an  acoustic  event  until  context  can  be  considered. 

Allophonic  variation  refers  to  the  language  specific 
systematic  use  of  different  sound  segments  (phones)  to 
represent  a  particular  phoneme.  For  example,  the  voiceless 
stop  consonants  /p/,  /t/,  /k/  have  three  allophones  in  English 

9 

which  occur  in  contexts  specified  by  definable  rules. 

Aspirated  voiceless  stops  (produced  with  an  audible  puff  of 
air  at  release)  are  used  at  the  beginning  of  stressed 
syllables  (e.g.,  pea).  Unaspirated  stops  (produced  without 
the  audible  release  of  air)  are  used  (a)  at  the  beginning  of 
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unstressed  syllables  (e.g.,  appear),  (b)  in  clusters  with  /s/ 

(e.g.,  speak),  and  (c)  at  the  ends  of  words  (e.g.,  keep). 

Unreleased  stops  (produced  without  opening  the  vocal  tract 

after  closure)  are  used  (a)  when  the  stop  consonant  precedes  a 

homorganic  (same  place)  consonant  (e.g.,  keep  me),  and  (b) 

optionally  at  the  end  of  a  phrase  (e.g.,  keep). 

Stress  and  rate  of  production  affect  the  way  in  which 

speech  sounds  are  articulated  and,  as  a  consequence,  affect 

their  acoustic  representations.  Under  conditions  of  increased 

speech  rate,  the  duration  of  some  speech  segments  is 

decreased,  and  the  articulatory  targets  achieved  typically 

"fall  short. Stressed  segments,  on  the  other  hand,  are 

known  to  be  longer  and  to  more  closely  approximate 

9 

articulatory  targets. 

Inter-speaker  differences  in  physical  structure  are 
another  source  of  acoustic-linguistic  mismatch.  The 
differences  in  acoustic  output  that  arise  from  differences  in 
the  physical  structures  themselves  are  predictable  from  the 
acoustic  theory  of  production.1*  In  this  widely  accepted 
view,  the  vocal  tract  is  considered  to  be  a  resonating  tube 
where  movement  of  the  vocal  structures  alters  the  shape  of  the 
tube  and  results  in  the  production  of  different  sounds. 
Important  considerations  are  age  and  sex  of  the  speaker,  since 
both  of  these  parameters  influence  the  size  of  the  larynx  and 
the  size  of  the  vocal  tract,  causing  differences  in 
fundamental  frequency  and  formant  frequencies. 

Dialectical  variations  include  the  use  of  different 

9 

phonemic  contrasts  by  speakers  of  subgroups  of  a  language. 
These  variations  are  historically  derived  from  different 
language  backgrounds  and  geographical  isolation  of  populations 
of  speakers  of  the  language.  The  phonemic  differences  found 
are  concentrated  in  the  vowels  and  r-like  sounds  of  English. 

As  such,  dialectical  variation  is  an  important  consideration 
in  any  automatic  recognition  scheme  designed  to  identify 
vocalic  productions. 
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IV. 


APPROACH: 


General . 

The  acoustic  theory  of  speech  production  considers  the 

vocal  tract  as  a  resonating  tube  that  filters  the  sound 

produced  by  one  of  a  variety  of  sources,  primarily  the  many 

9 

forms  of  phonation  produced  by  the  larynx.  For  non-nasal 

14 

vowels,  an  approximate  representation  of  the  filter  is: 


T  ( s ) 


n/2 

ir 

i  =  l 


r  .  r  .  * 

l  l _ 

( s  +  r i )  (s  +  ^i*j 


(4) 


where  the  constant  r^  and  its  complex  conjugate  r^*  are 
determined  by  the  values  of  the  i-th  formant  frequency  f^ 
and  its  bandwidth  bwi .  That  is, 

r.  =  nbw.  +  j2*f.  (5) 

Kenneth  Stevens  has  used  simple  acoustic  tube  models  to 
investigate  the  interrelationships  between  the  shape  of  the 
vocal  tract  and  the  various  formant  frequency  and  bandwidth 
changes.  As  a  result,  Stevens  proposed  that  there  is  a 
quantal  nature  to  speech.  That  is,  "there  are  certain 
articulatory  conditions  for  which  a  small  change  in  some 
parameter  describing  the  articulation  gives  rise  to  an 
apparently  large  change  in  the  acoustic  characteristics  of  the 
output;  there  are  other  conditions  for  which  substantial  per¬ 
turbations  of  certain  aspects  of  the  articulation  produce 
negligible  changes  in  the  characteristics  of  the  acoustic 

1  N  1 8 

signal . 

For  the  high  front  vowel  /i/,  Stevens'  acoustic  analysis 
predicts  a  low  first  formant  and  that  formants  2  and  3  should 
be  close  together.  Furthermore,  he  concluded  that  for  the  low 
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and  high  back  vowels  /a,  u/,  formants  1  and  2  should  be 
close.  In  the  following  discussion,  the  sensitivity  matrix  is 
proposed  as  a  method  to  locate  and  characterize  these  kinds  of 
formant  interrelations. 

Expanding  the  denominator  of  T(s)  gives  the 
characteristic  equation: 

q(s)  -  sn  +  +  ...  +  ans  +  an+1  (6) 

If  any  coefficient  a.  is  varied,  then  each  constant  .  must 

1  15 

also  change.  The  sensitivity  matrix,  defined  in  '~*-ion 

(2),  is  a  relative  measure  for  the  extent  of  these  c  iges. 

As  illustrated  in  Figure  1  for  the  case  where  n  equals  six,  it 

is  possible  to  vary  each  coefficient,  one  at  a  time,  from  zero 

to  infinity  and  make  a  sketch  of  the  corresponding  roots 

1 6 

(root-locus)  of  the  characteristic  equation.  These  root 
changes  reflect,  as  described  by  equation  (5),  the  changes  in 
formant  frequencies  and  formant  bandwidths. 

Note  that  the  elements  S(i,j)  of  the  sensitivity  matrix 
are  proportional  to  the  slopes  of  root-locus  branches  at  the 
points  corresponding  to  the  particular  coefficient  values. 

Thus  the  elements  of  the  sensitivity  matrix  are  complex 
quantities  which  express  the  magnitude  and  direction  of  root 
changes  due  to  coefficient  changes.  Because  S(i,j)  is 
normalized,  a  direction  or  phase  of  0  means  that  the  root  is 
moving  in  the  direction  of  the  vector  from  the  s-plane  origin 
to  the  root. 

For  example,  at  the  points  labeled  1  on  the  curves  for  a2 
shown  in  Figure  1,  increasing  a2  results  in  small  changes  in 
the  three  formant  frequencies,  but  significantly  changes  the 
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bandwidths.  Expressed  in  terms  of  sensitivity  elements, 

S(2,l)  and  S(2,3)  have  a  phase  of  essentially  90  degrees, 
whereas  S(2,2)  has  a  phase  of  essentially  270  degrees.  By 
contrast,  at  the  points  labeled  1  on  the  curves  for  a^  shown 
in  Figure  1,  increasing  a^  results  in  small  changes  in 
bandwidth,  but  significant  changes  in  formant  frequencies. 
Expressed  in  terms  of  sensitivity  elements,  S(3,l)  and  S(3,3) 
have  a  phase  of  essentially  0  degrees,  whereas  S(2,2)  has  a 
phase  of  essentially  180  degrees. 

Previous  Work . 

The  investigations  initiated  during  the  1982  Summer 

Faculty  Research  Program  have  utilized  vowel  data  reported  by 
14 

Dennis  Klatt  and  shown  m  Figure  2.  These  data  are  for  his 
voice  and  are  a  composite  obtained  from  the  analysis  of  many 
consonant-vowel  productions.  The  parameters  listed  are  the 
initial  and  final  values  of  the  first  three  formants  and  their 
bandwidths.  Each  vowel  is  represented  by  Klatt  with  a  two 
letter  code  which  will  be  used  throughout  this  paper.  The 
correspondence  between  this  code  and  a  more  standard  phonetic 
transcription  is  seen  in  Figure  2.  A  computer  program  was 
written  to 

(1)  calculate,  from  the  given  formant  frequencies  and 
bandwidths,  the  coefficients  of  the  characteristic 

equation , 

(2)  vary  each  coefficient  a^  by  +25%  or  +50%  from  its 
initial  or  nominal  value, 

(3)  calculate  the  corresponding  elements  of  the 
sensitivity  matrix  and, 

(4)  make  plots  for  the  magnitude  and  phase  of  each 
sensitivity  element  as  each  coefficient  a^  is 

varied.  All  the  vowel  data  of  Figure  2  has  been  processed  and 
the  results  are  briefly  summarized  in  the  following 

discussion.  The  final  report  contains  a  more  complete 

,.  .17 

discussion. 
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D.H.  Klatt 


Tcble  2.  Paraaeter  values  for  the  synthesis  of  selected  vowels. 

If  two  values  are  given,  the  vowel  Is  diphthongized  or  has  a 
achwa-llke  offgllde  In  the  speech  of  the  author.  The  amplitude  of 
voicing,  AV,  and  fundaaental  frequency,  FO,  oust  also  be  given 


contours  appropriate 
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an  Isolated 

vowel 

• 

Vowel 

n 

F2 

11 
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82 

51 

IT  i 

310 

2020 

2960 
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200 

boo 

290 

2070 

2960 

60 

200 

boo 

IH  X 

ooo 

1600 

2570 

50 

100 

IbO 

b70 

1600 

2600 

50 

100 

iso 

ET  t 
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1720 

2520 

70 

100 

200 

330 

2020 

2600 

55 

100 

200 

EH  g 
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1660 

2500 

60 

90 

200 

620 

1530 

2530 

60 

90 

200 

AE  R 
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AA  0. 

700 

1220 

2600 

130 

70 

160 
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80 
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70 

70 
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80 

70 

70 

UH  V 
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2350 

80 

100 

80 
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2390 

80 

100 

80 

UW  M. 
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65 
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2200 

65 
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ER  r 

b70 
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15«0 
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60 

1 10 

b20 
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15b0 

100 
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FIGURE  2 


FORMANT  FREQUENCIES  AND  BANDWIDTHS  OF  SELECTED  VOWELS 
(reproduced  from  reference  14,  page  291) 
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Figures  3  and  4  show  the  magnitude  and  phase  plots  for 
S(2,*)  for  the  high  front  vowel  /I Y/  as  coefficient  a(2)  is 
varied  +25%  from  its  nominal  value.  In  all  plots,  formants 
1,  2,  and  3  correspond  to  plotting  symbols  *,A,  and  a  . 

Thus,  Figure  3  shows  that  formant  3  is  the  most  sensitive  and 
formant  1  is  the  least  sensitive  to  changes  in  coefficient 
a(2).  But  even  a  sensitivity  of  .2  is  small.  Referring  to 
Figure  4,  the  phase  curves  of  formants  1  and  3  are  essentially 
90  degrees,  and  the  phase  of  formant  2  is  essentially  270 
degrees.  Thus,  an  increase  in  coefficient  a(2)  increases  the 
bandwidths  of  formants  1  and  3,  decreases  the  bandwidth  of 
formant  2,  and  essentially  does  not  change  any  of  the  formant 
frequencies.  This  type  of  influence  was  found  to  hold  for  all 
vowels  and  also  for  coefficients  a(4)  and  a(6). 

For  the  same  phoneme  /IY/,  Figures  5  through  8  show  the 
magnitude  and  phase  plots  for  sensitivity  elements  S(3,*)  and 
S(5,*)  as  coefficients  a(3)  and  a(5)  are  varied.  Figures  6 
and  8  show  that  under  the  nominal  conditions  of  1.0  a  ( 3 )  and 
1.0  a(5),  the  phase  associated  with  each  formant  is 
essentially  0  or  180  degrees.  A  phase  of  0  means  that  the 
root  is  moving  in  the  direction  of  the  vector  from  the  s-plane 
origin  to  the  root. 

Referring  to  Figure  6,  formant  frequencies  1  and  3  are 
essentially  increasing,  formant  frequency  2  essentially 
decreasing,  and  only  small  changes  are  occurring  in  the 
formant  bandwidths  as  coefficient  a(3)  increases .  As  shown  in 
Figure  8,  the  same  kind  of  changes  occur  with  a  decrease  in 
coefficient  a(5).  This  kind  of  influence  was  also  found  to 
hold  for  coefficient  a(7)  and  for  all  the  vowels. 
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The  observation  that  the  phase  relations  described  for 
odd  and  even  numbered  coefficients  holds  for  all  vowels 
suggests  a  categorical  indicator  for  non-vowel  waveforms. 

Further  examination  of  the  phase  curves  of  Figures  6  and 
8  show  that  coefficient  a{3)  is  essentially  as  low,  and 
coefficient  a(5)  is  essentially  as  high,  as  possible  without 
formants  2  and  3  moving  "around  the  corners"  in  their 
associated  root-locus  branches.  That  is,  with  nominal 
coefficient  values,  formants  2  and  3  are  essentially  as  close 
together  as  possible  without  major  changes  in  their 
bandwidths.  This  type  of  relationship  between  formants  2  and 
3  was  found  to  hold  for  the  mid-vowel  /ER/  and  all  the  front 
vowels  /I Y ,  IH,  EY,  EH,  AE/.  Thus  there  is  a  clear 
categorical  indicator  for  this  group  of  vowels. 

In  contrast,  the  nominal  value  of  coefficient  a(3)  is 
intermediate  between  the  root-locus  corners  of  formants  2  and 
3  and  the  root-locus  corners  of  formants  2  and  1  for  the  low 
back  vowel  /AH/.  Also,  Figures  9  and  10  show  that  S(5,*)  is 
now  quite  different  in  that  the  nominal  value  of  coefficient 
a(5)  is  essentially  as  low  as  possible  without  formants  1  and 
2  moving  around  their  root-locus  corners.  Again,  these  kinds 
of  relationships  hold  for  all  the  back  vowels  /AA,  AO,  AH,  OW, 
UH,  UW,  AY,  AW,  OY/  and  serve  as  a  clear  categorical 
indicator . 

Changes  in  the  sensitivity  elements  also  reflect  the 
changes  that  occur  in  moving  from  a  high  front  vowel  like  /IY/ 
to  a  low  front  vowel  like  /AE/.  Sensitivity  elements  S(3,l) 
and  S(5,l)  are  larger  for  vowel  /AE/  since  the  root-locus 
corners  for  formants  2  and  3  are  closer  to  the  root-locus 
corners  for  formants  1  and  2.  The  changes  in  sensitivity 
element  S{3,1)  in  moving  from  the  highest  to  the  lowest  front 
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vowel  are  shown  in  Figure  11.  To  mimic  the  effect  of  noise, 
also  shown  on  the  figure  is  how  the  sensitivity  element  S(3,l) 
changes  due  to  a  ^5%  change  in  coefficient  a(5). 

These  results  suggest  that  the  sensitivity  elements  may 
be  sufficient  to  identify  the  particular  front  vowel.  If  the 
starting  conditions  of  some  front  vowels  are  similar,  such  as 
the  long/short  pair  /EY,  EH/,  further  specificity  may  be 
obtained  by  observing  the  subsequent  changes  in  the 
sensitivity  elements.  For  example,  the  Klatt  data  is  somewhat 
diphthongized  and  S ( 3 , 1 )  for  phoneme  /EY/  decreases  from  .067 
to  .023,  whereas  for  phoneme  /EH/  it  increases  from  .087  to 
.150. 


The  changes  that  occur  among  the  group  of  back  vowels  are 
reflected  by  the  relative  location  of  the  root-locus  corners 
for  formants  2  and  3  and  the  root-locus  corners  for  formants  1 
and  2.  Values  of  the  sensitivity  matrix  are  a  measure  of 
these  locations  and  may  be  sufficient  to  identify  sub-groups 
as  well  as  the  particular  back  vowels.  Illustrated  by  Figure 
12  is  the  low  back  vowel  sub-group  /AA,  AY,  AW,  AH/. 

Subsequent  changes  in  the  diphthongs  /AY,  AW/,  (AfO)  may 
again  provide  a  greater  specificity  among  the  elements  of  this 
sub-group.  The  remaining  back  vowels  are  shown  in  Figure  13 
where  /OY/  (A)  is  another  diphthong. 

For  the  high  front  vowel  /IY/,  Stevens'  acoustic  analysis 
predicted  a  low  first  formant  and  that  formants  2  and  3  should 
be  close  together.  The  sensitivity  analysis  of  Klatt's  data 
corroborates  and  extends  Stevens'  results  by  showing  that  this 
condition  holds  for  all  front  vowels  and  for  the  mid-vowel 
/ER/.  Stevens  also  considered  the  low  and  high  back  vowels 
/AA,  UW/  and  in  each  case  concluded  that  formants  1  and  2 
should  be  close.  Again,  the  sensitivity  analysis  of  Klatt's 


FIGURE  13  -  KLATT  DATA  /AO.  OY.  OW,  UH,  UW/ 
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data  corroborates  and  extends  Stevens'  results  by  showing  that 
this  condition  holds  for  all  back,  vowels. 

In  summary,  the  sensitivity  matrix  was  evaluated  for  the 
initial  and  final  representations  of  the  fifteen  vowels  in 
Klatt’s  data  set.  It  was  found  that  (1)  the  sensitivity 
matrix  does  provide  a  measure  of  the  degree  to  which  a  sound 
is  "on  target"  by  locating  the  sound  relative  to  the 
root-locus  corners  of  formants  2  and  3  and  those  for  formants 
2  and  1;  (2)  that  formants  2  and  3  being  close  to  their  root- 
locus  corners  provides  a  categorical  indicator  for  the  group 
of  front  vowels;  (3)  that  formants  2  and  1  being  close  to 
their  root-locus  corners  provides  a  categorical  indicator  for 
the  group  of  back  vowels;  and  (4)  that  particular  elements  of 
the  sensitivity  matrix  may  provide  sufficient  information  to 
identify  the  particular  vowel. 

V.  PRESENT  WORK: 

Overview. 

The  positive  results  obtained  with  the  Klatt  data  demon¬ 
strated  that  further  investigations  with  the  sensitivity 
analysis  were  warranted.  A  sequence  of  studies  utililzing 
"real"  speech  obtained  from  several  male  and  female  speakers 
should  clarify  its  usefulness.  Furthermore,  any  particular 
method  of  segmentation  and  identification  of  phonemes  should 
be  challenged  by  speech  material  which  presents,  in  both  a 
controlled  and  naturalistic  manner,  as  many  of  the  factors 
known  to  cause  acoustic-phonetic  variations  as  possible.  The 
entire  set  of  English  vowels  should  be  used  in  conjunction 
with  a  number  of  consonants  that  sample  coar ticulatory 
variations.  These  consonants  should  include  (1)  differing 
manners  -  stops,  fricatives,  approximants ,  liquid  vs.  glide 
vs.  nasal  contrast,  (2)  different  voicing,  and  (3)  differing 
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place-labial  vs.  velar.  Stress,  tempo,  and  word  position- 
structure  should  also  be  included. 

The  sequence  of  studies  could  be  described  as  follows. 
Task  I  should  evaluate  the  changes  in  the  sensitivity  matrix 
that  occur  during  the  production  of  the  vowels  in  the  words, 
test  the  conclusions  reached  with  Klatt's  data  set,  and 
evaluate  inter-speaker  variations.  Task  II  should  evaluate 
and  interpret  changes  in  the  sensitivity  matrix  for  the  vowels 
due  to  coarticulation.  Task  III  should  evaluate  and  interpret 
the  sensitivity  matrix  for  the  differing  initial  consonants. 
Finally,  Task  IV  should  use  the  above  results  to  build  a 
reference  library,  and  should  evaluate  the  efficacy  of  the 
sensitivity  matrix  in  terms  of  the  accuracy  of  the  resulting 
phonetic  representation  of  unknown  speech. 

It  was  considered  premature  and  unrealistic  to  include 
all  these  factors  and  studies  in  the  current  research  plan. 
Instead,  only  the  three  stop-consonants  /b,  d,  g/  were  used  in 
single  words  with  the  six  vowels  /i,  e ,  3e  , 3  , A  ,  u/.  The 
three  consonants  were  selected  because  they  have  the  same 
manner,  the  same  voicing,  but  the  differing  place  should 
induce  substantial  coarticulatory  variations.  Of  the  six 
vowels,  three  are  front  and  three  are  back.  They  were 
selected  because  /i/  is  "far  away"  from  /c ,  /,  whereas  /e/ 

and  / <32  /  are  "close"  and  "difficult"  to  distinguish  using 
current  methods  and  techniques.  The  same  kind  of  relationship 
holds  for  /u/  and  the  pair  /0,A/.  The  words  shown  in 
Table  1  were  chosen  to  express  these  sets  of  consonants  and 
vowels . 
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TABLE  1.  TEST  WORDS  USED  IN  CARRIER  PHRASE  "SAY  (WORD)  AGAIN." 


VOWEL 

CONSONANT 

/b/ 

/d/ 

/g/ 

/i/ 

bead 

deed 

geese 

A/ 

bed 

dead 

guess 

/a?/ 

bad 

dad 

gas 

/A/ 

bud 

dud 

gus 

/o/ 

baud 

dawdle 

gauze 

/u/ 

booed 

dude 

goose 

Using  this  selected  set  of  data,  the  studies  were  limited 
to  those  of  Task  I.  It  was  anticipated  that  the  elements  of 
the  sensitivity  matrix  would  change  during  the  production  of  a 
test  word.  These  changes  in  the  sensitivity  matrix  were 
analyzed,  as  with  Klatt's  data  set,  to  determine  if  (1)  there 
were  general  properties  that  hold  for  all  vowels  and  thereby 
provide  a  measure  of  the  degree  to  which  a  vowel  was  "on 
target,"  (2)  it  had  properties  that  provided  categorical 
indicators  for  particular  subgroups  of  the  vowels,  and  (3) 
whether  they  provided  sufficient  information  to  identify  each 
vowel.  Also,  multiple  repetitions  and  multiple  speakers 
allowed  modest  statistical  assessment  of  intra  and 
inter-speaker  variations. 

Methods . 

The  subjects  were  five  male  speakers  with  differing 
fundamental  frequency  and  dialect.  Dialects  chosen  were 
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representative  of  Vermont,  Pennsylvania,  Virginia,  Rhode 
Island,  and  Michigan.  Speakers  with  these  dialects  were 
readily  available  and  sampled  those  described  by  the 
Linguaphone  Institute,  American  Dialect  Series.  Each  speaker 
made,  in  fully  randomized  order,  three  repetitions  of  each 
word  in  the  carrier  phrase  "Say  (word)  again"  on  each  of  four 
days.  Thus  the  final  corpus  of  utterances  consisted  of  5 
speakers  X  12  repetitions  X  6  vowels  X  3  consonants. 

All  master  recordings  were  made  in  an  Industrial 
Acoustics  sound  room  using  a  Nakamicki  550  portable  cassette 
recorder.  Subjects  were  instructed  to  use  equal  effort  to 
produce  the  samples  while  speaking  in  a  normal  conversational 
tempo  and  voice  into  a  head-band  held  Teledyne  EC-101  electret 
microphone.  Subjects  were  instructed  to  produce  a  word  which 
was  fully  pronounced;  that  is,  no  casual  speech  alternations 
of  word  structure  were  accepted.  VU  levels  were  monitored  as 
a  check  on  speaking  level.  Two  expert  phoneticians  monitored 
speech  productions  and  rejected  any  sample  which  was  not 
jointly  recognized  "live"  as  an  adequate  production  of  the 
word.  They  transcribed  each  vowel  production  to  determine  its 
perceptual  quality  vis  a  vis  a  traditional  phonetic  vowel 
quadrangle . 

During  playback,  the  master  recordings  were  bandlimited 

to  4.8  KHZ  and  were  digitized  at  a  sampling  interval  of  83 

microseconds  using  the  12-bit  A/D  converter  on  the  PDP  11-34 

computer.  Using  a  waveform  editing  program,  a  particular 

carrier  phrase  was  displayed  on  the  AED  512  color  graphics 

terminal,  and  the  test  word  was  excised  for  storage  and 

analysis.  Each  test  word  was  divided  into  successive  frames. 

For  each  frame,  the  formant  frequencies  and  bandwidths  were 

1 9-23 

calculated  via  linear  predictive  analysis.  Using  only 

the  first  three  formant  frequencies  and/or  bandwidths  for  each 
frame,  the  elements  of  the  sensitivity  matrix  were  calculated. 
As  an  aid  for  interpretation  of  results,  the  color  graphics 
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terminal  was  used  to  plot  the  test  word  waveform  along  with 
waveforms  of  the  corresponding  elements  of  the  sensitivity 
matrix.  The  following  discussion  develops  the  signal  analysis 
problems  and  methods  in  greater  detail. 

In  linear  predictive  analysis,  an  all-pole  model  of  a 

signal  is  determined  by  predicting  each  signal  sample  as  a 

linear  combination  of  some  number  of  previous  signal  samples. 

Programs  were  implemented  on  the  laboratory  computer,  a  PDP 

11-34,  for  both  the  covariance  and  autocorrelation  methods  of 

24 

determining  the  linear  predictive  coefficients.  After 
testing  each  method  with  typical  sets  of  speech  data,  it  was 
decided  that  further  studies  should  utilize  the  covariance 
method  since  it  provided  less  f rame-to-f rame  variation  and  a 
smaller  prediction  error  than  did  the  autocorrelation 
method . 25 

These  initial  studies  with  typical  speech  data  were  also 

used  to  decide  on  other  details  of  the  signal  analysis 

methods.  By  comparing  the  results  obtained  when  the  predictor 

order  was  varied  from  8  to  22,  it  was  found  that  the  predictor 

order  should  be  at  least  ]G.  The  resolution  provided  by  this 

high  predictor  order  was  necessary  when,  as  in  high  back 

vowels  like  /UW/,  two  formant  frequencies  of  unequal  bandwidth 

were  close  together.  Use  of  high-pass  filtering,  where  the 

corner  frequency  is  below  the  first  formant  frequency,  and/or 

2  6 

use  of  pre-emphasis  would  allow  a  smaller  predictor  order. 
Because  of  their  added  complexity  and  their  influence  on  the 
location  of  the  spectral  peak  for  the  first  formant  frequency, 
it  was  decided  not  to  use  high-pass  filtering  or  pre-emphasis. 

Contiguous  and  fixed  frame  lengths  of  256  speech  signal 
samples  were  utilized  in  the  initial  studies.  Since  the 
frames  were  not  pitch  synchronous,  the  corresponding 
sensitivity  analysis  showed  some  cyclic  frame-to-frame 
variations.  Thus,  it  was  decided  to  adopt  a  quasi- 


synchronous  framing  method  where  the  frame  length  was  selected 

as  twice  the  estimated  pitch  period  and  each  frame  began  in 

the  middle  of  a  pitch  period.  Pitch  period  estimates  were 

made  using  a  simplified  filter  tracking  algorithm  (SIFT)  by 
27 

Markel  and  were  updated  each  512  speech  samples.  In  this 
algorithm,  the  speech  waveform  was  down-sampled,  low-pass 
filtered,  and  represented  by  a  fourth  order  all-pole  model. 

The  inverse  filter  was  formed  by  inverting  the  transfer 
function  of  the  all-pole  model.  The  pitch  period  was 
estimated  from  the  peak  in  the  autocorrelation  sequence  which 
was  calculated  with  the  signal  that  results  from  passing  the 
processed  signal  through  the  corresponding  inverse  filter. 

After  framing  the  speech  data,  as  described  above,  and 

calculating  the  coefficients  of  the  all-pole  model  using  the 

covariance  method,  it  was  necessary  to  calculate  the  formant 

frequencies  and  their  bandwidths.  Initial  studies  utilized 

20 

the  roots  of  the  denominator  of  the  transfer  function.  Any 

real  roots,  or  any  complex-conjugate  roots  with  "wide" 

bandwidths  were  discarded  since  the  resonant  peaks  were 

considered  to  be  represented  by  complex-conjugate  poles  with 

"narrow"  bandwidths.  This  method  was  abandoned  because  of 

difficulty  in  judging  narrow  versus  wide  bandwidths.  Instead, 

the  all-pole  model  was  used  to  calculate  the  signal  spectrum, 

2  6 

and  a  peak-picking  algorithm  was  implemented.  It  was 
necessary  to  develop  some  decision  logic  since  multiple 
spectral  peaks  were  sometimes  found  to  occur  near  or  below  the 
first  formant  frequency.  This  was  done  by  placing  upper  and 
lower  bounds  on  the  first  formant  frequency  and  associating  it 
with  the  spectral  peak  of  the  smallest  bandwidth  that  occurred 
between  these  bounds. 
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Once  the  first  three  formant  frequencies  and  bandwidths 
were  determined  for  each  frame,  the  elements  of  the 
sensitivity  matrix  were  calculated  using  equations  (5)  and 
(3).  It  was  anticipated  that  the  elements  of  the  sensitivity 
matrix  would  change  during  the  production  of  a  test  word  and 
that  the  angle  of  the  sensitivity  elements  would  provide  a 
measure  of  the  degree  to  which  a  vowel  was  "on  target"  by 
characterizing  the  root  locations  relative  to  the  root-locus 
corners.  Results  from  initial  studies  tended  to  support  this 
hypothesis  but  they  also  showed  frame-to-frame  variations  in 
sensitivity  angle  that  resulted  from  corresponding  changes  in 
the  formant  bandwidth  estimates.  Regardless  of  whether  these 
variations  occurred  because  tne  signal  analysis  methods  did 
not  provide  accurant  formant  bandwidth  estimates  and/or 
because  the  speech  process  does  not  accurately  control  energy 
loss  mechanisms,  it  was  decided  to  approximate  the  system  by  a 
lossless  model. 

The  hypothesis  of  a  lossless  model  implies  that  knowledge 
of  the  formant  frequency  bandwidths  is  not  essential  for  the 
recognition  of  the  vowels  in  the  test  set  of  words.  Under 
these  conditions,  equations  (5)  and  (6)  become: 

ci  =  J2*1!  i  *  1,  ....  y  (7) 

q<s>  =  sn  ♦  a3s"-2  +  ...  +  an.lS2  +  antl  ,8) 

where  n  is  even. 

In  order  for  the  angle  of  the  sensitivity  elements  to  be 
expressed  in  rectangular  cartesian  form,  the  definition  of 
equation  (2)  was  changed  as  follows: 
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S(k,i)  =  f- 
i 


dr  . 

_ _ 1 

da, 


k  (9) 

This  definition  easily  leads  to  the  following  closed-form 
expression: 

n-k+1 


S (k,  i  )  =  - 


ri! 


dq(  s ) 
ds 


s  =  -ri 


(10) 

Equation  (10)  was  used  in  lieu  of  equation  (3)  because  it  was 
computationally  more  efficient. 

Summing  the  sensitivity  elements  across  any  row  of  the 
matrix  easily  leads  to: 
n 


i  s«.i, .  n  nr. 

i=l 


n+1 

n+1 


Combining  this  constraint,  on  the  elements  in  any  row,  with 
the  fact  that  the  sensitivity  elements  of  complex-conjugate 
roots  r^  are  themselves  complex-conjugates,  shows  that  only 

j  -  1  of  the  elements  in  each  row  are  independent. 


Furthermore,  equation  (10)  shows  that  the  elements  in  each 
column  are  simply  related  by  the  factor  a.  (-r.) 

K  X 


Thus,  y  -  1  of  the  elements  in  any  row  of  the  sensitivity 

matrix  should  be  sufficient  to  characterize  the  root 
sensitivity  patterns.  For  this  research  project,  n  was  equal 
to  6  since  only  the  first  three  formants  were  considered. 
Thus,  attention  was  focused  on  the  two  elements  S(3,l)  and 
S(3,3)  which  expressed  the  sensitivity  of  formant  frequencies 
1  and  3  to  changes  in  coefficient  a^. 


For  this  lossless  case,  the  corresponding  root-locus  of 
Figure  1  is  simplified  as  shown  in  Figure  14.  Since  only  the 
portion  of  the  root-locus  on  the  j  u  axis  is  consistent  with 
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the  lossless  condition,  the  points  marked  CR2  and  CR1  portray 

the  minimum,  a-,  .  ,  and  maximum,  a0  ,  values  permitted  for 

coefficient  a^.  At  corner  location  CR2,  formants  2  and  3  are 

equal  and  attain  their  corresponding  maximum  and  minimum 

values.  The  minimum  value  for  the  frequency  of  formant  1 

occurs  with  a,  .  .  At  the  other  corner  location  CRl, 

3  mm 

formants  1  and  2  are  equal  and  attain  their  corresponding 
maximum  and  minimum  values.  The  maximum  value  for  the 
frequency  of  formant  3  occurs  with  a-,  .  A  direct 

calculation  was  made  for  these  values  of  coefficient  a^  and 
the  corresponding  frequencies. 


!  jo 


formant 

branch 


for  a3  min 


formant  2 
branch 


CR  1  for  a^ 


max 


o 


i 
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FIGURE  14  -  ROOT-LOCUS  SKETCH  FOR  LOSSLESS  MODEL 
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It  was  anticipated  that  the  degree  to  which  a  vowel  was 
"on  target"  could  be  described  by  the  root  locations  relative 
to  the  root-locus  corners  CR2  and  CRl.  Two  measures  of  this 
property  were  defined: 


Corner 

Ratio 

1  SB 

3 

J. 

3  max 

(ID 

Corner 

Ratio 

2  = 

a-, 

3  min 

a3 

(12) 

If  formants  2  and  3  being  close  was  characteristic  of  front 
vowels,  then  Corner  Ratio  2  should  be  large.  Likewise,  if 
formants  1  and  2  being  close  was  characteristic  of  back 
vowels,  then  Corner  Ratio  1  should  be  large. 

Initial  studies  indicated  that  from  frame  to  frame. 

Corner  Ratio  1  was  a  smooth  and  well  behaved  curve  during  the 
vowel  portion  of  a  word  and  could  thus  serve  to  segment  the 
vowel  interval.  Moreover,  within  the  vowel  interval,  the 
curve  generally  had  "flat"  spots.  A  computer  program  was 
written,  called  Algorithm  1,  to  locate  the  maximum  such  "flat" 
spot  so  that  its  utility  as  a  frame  selector  could  be 
evaluated.  For  some  speakers,  as  illustrated  by  Figure  15, 
the  sensitivity  elements  S(3,l)  and  S(3,3)  of  the  selected 
frame  may  be  used  to  accurately  identify  each  of  the  six 
vowels . 

Each  vowel  /IY,  EH,  AE,  AH,  AO,  UW/  is  indicated  by  a 
different  symbol,  as  shown  in  the  legends,  and  there  is  no 
overlap  among  these  six  vowel  groups.  For  other  speakers,  the 
sensitivity  values  of  the  frames  selected  by  Algorithm  1  may 
overlap  for  neighboring  vowels  such  as  /EH,  AE/.  In  such 
cases,  the  corresponding  vowel  may  only  be  identified  as  one 
of  a  neighboring  pair  of  vowels.  This  result  suggested  a  two 


z 


Page 


Page  37 


level  identification  scheme,  where  the  second  level  was  based 
on  vowel-pair-specific  algorithms  that  made  use  of  both  Corner 
Ratios.  Two  such  algorithms,  called  Algorithm  2,  were 
developed:  to  distinguish  /EH/  from  /AE/  and  to  distinguish 

/AO/  from  /AH/.  These  algorithms  are  not  speaker  specific, 
nor  are  any  of  the  other  algorithms. 

For  the  neighboring  front  vowels  /EH,  AE/,  Algorithm  2 
selects  the  frame  at  the  maximum  "flat"  or  "smooth"  region  of 
the  Corner  Ratio  2  curve  that  occurs  prior  to  the  maximum 
value  in  the  Corner  Ratio  1  curve.  Thus,  at  the  selected 
frame,  formants  2  and  3  are  close.  In  the  case  of  neighboring 
back  vowels  /AH,  AO/,  Algorithm  2  selects  the  frame  (1)  at  a 
minimum  of  the  Corner  Ratio  2  curve  (2)  at  a  maximum  "flat" 
spot  if  the  Corner  Ratio  2  curve  trends  upward  or  (3)  at  a 
minimum  "flat"  spot  if  the  Corner  Ratio  2  curve  trends 
downward.  Thus,  at  the  selected  frame,  formants  2  and  3 
either  have  a  maximum  spread  or  have  a  spread  and  stationary 
relationship. 

Algorithm  2  has  been  applied  "by  hand"  to  obtain  the 
results  presented  in  this  report.  For  use  in  future  studies, 
a  computer  program  is  being  written  that  will  implement 
Algorithm  2.  Other  developments  needed  in  the  area  of  signal 
processing  are  outlined  in  the  section  titled  Future  Work. 


Results . 

The  five  speakers  used  in  this  study  were  chosen  to 

represent  a  variety  of  American  English  dialects  as  defined  by 
2  8 

Thomas.  Speaker  1,  from  rural  Vermont,  represented  the 

Eastern  New  England  area.  Speaker  2,  originally  from  Detroit, 
represented  the  North  Central  region.  Speaker  3,  from  Rhode 
Island,  represented  the  New  York  area.  Speaker  4,  from 


Page  38 


coastal  Virginia,  represented  the  Mid  Atlantic  area.  Speaker 
5,  from  Pittsburgh,  represented  the  Western  Pennsylvania  area. 
Table  2  shows  the  transcription  most  commonly  used  to  describe 
each  speaker's  production  of  general  dialect  variants  of  each 
vowel  as  judged  "live"  by  two  expert  phoneticians. 


TABLE  2.  GENERAL  AMERICAN  ENGLISH  DIALECT  PRODUCTION  AND 
ASSOCIATED  SPEAKER  PRODUCTS 

GENERAL  AMERICAN  DIALECT  (G.A.D. ) 


SPEAKER 

/i/ 

_/  e/ 

/as/ 

/A/ 

/u/ 

/=/ 

1 

i 

e 

32 

A 

u 

a* 

2 

i 

e 

as 

A 

u 

D 

3 

i 

e 

as. 

A 

u 

b  * 

4 

i 

e7*  * 

xe* 

A.* 

u 

D 

5 

i 

e 

Be 

A 

u 

D 

*Productions  which 

differ 

from  G.A.D. 

Speakers  2 

and 

5  consistently  maintained  G.. 

A.D.  ] 

pronunciat 

of  all  vowels. 

Speaker 

1  used 

a  consistent 

/a/  ' 

or  /a/ 

production 

for 

/0/> 

Speaker  3 

used  c 

i  consistent 

/t>/ 

production  for  /2/.  Speaker  4  demonstrated  a  widespread 
tendency  to  diphthongize  all  productions,  particularly 
productions  of  /e/»  /2e/,  and  /a/* 

The  results  described  below  were  obtained  from  the 
analysis  of  one  of  the  three  repetitions  of  each  test  word 
recorded  on  each  of  the  four  days.  The  data  from  the 
remaining  repetitions  remains  available  for  future  studies. 
As  described  in  the  Methods  section,  each  test  word  was 
divided  into  successive  frames.  For  each  frame,  the 
coefficients  of  an  18  pole  model  were  calculated  using  the 
covariance  method.  A  peak-picking  algorithm  located  the 
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first  3  formant  frequencies  from  the  corresponding  signal 
spectrum.  These  were  used,  under  the  hypothesis  of  a  lossless 
model,  to  calculate  Corner  Ratio  1,  Corner  Ratio  2,  and  the 
sensitivity  elements  S(3,l)  and  S(3,3)  for  each  frame. 

Using  data  from  all  six  vowels  of  Subject  2,  Figures  16 
through  23  show  example  plots  of  these  four  quantities  versus 
Sequence  Number,  which  is  the  speech  sample  number.  For  ease 
of  comparison,  the  three  front  vowels  are  grouped  in  each  plot 
as  are  the  three  back  vowels.  In  each  of  these  curves,  the 
well  behaved  regions  correspond  to  the  vowel  portions  of  the 
words.  By  placing  limits  on  the  values  of  Corner  Ratio  1  and 
Corner  Ratio  2  and  on  their  smoothness,  the  vowel  intervals 
have  automatically  been  segmented  as  labeled  by  s,  for  start, 
and  e,  for  end. 

Figures  16  through  23  also  show,  as  labeled  by  1  and  2, 
the  frames  selected  by  Algorithms  1  and  2.  As  described  in 
the  Methods  section,  and  illustrated  in  Figures  16  and  18, 
Algorithm  1  located  the  frame  at  the  maximum  "flat"  or 
"smooth"  region  of  the  Corner  Ratio  1  curves.  If  the 
sensitivity  values  of  the  selected  frame  indicate  that  the 
vowel  can  only  be  identified  as  one  of  a  neighboring  pair  of 
vowels,  then  a  vowel-pair  specific  form  of  Algorithm  2  may  be 
utilized.  For  the  EH  or  AE  vowels.  Algorithm  2  located  the 
frames  at  the  maximum  "flat"  or  "smooth"  region  of  the  Corner 
Ratio  2  curves,  as  illustrated  in  Figure  17.  For  the 
contrasting  case  of  the  AH  or  AO  vowels.  Algorithm  2  located 
the  frame  at  the  minimum  of  the  Corner  Ratio  2  curves,  as 
illustrated  in  Figure  19. 

For  each  test  word  of  each  speaker,  the  sensitivity 
elements  S ( 3 , 1 )  and  S(3,3)  of  the  frame  selected  by  Algorithm 
2  are  plotted  in  Figures  24  through  28. 
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Each  vowel  is  indicated  by  a  different  symbol,  and  in  no  case 
is  there  an  overlap  among  elements  of  the  six  vowel  groups. 
These  results  indicate  that  it  should  be  possible  to 
accurately  identify  each  of  the  phonemes. 

Standard  statistical  methods  have  been  used  to  analyze 
these  data.  The  mean  and  standard  deviation  of  the 
sensitivity  elements  S(3,l)  and  S(3,3)  have  been  calculated 
for  each  vowel  and  for  each  consonant-vowel  combination  and 
are  tabulated  in  Tables  3  through  8.  Part  of  the  variation 
within  each  vowel  group  was  thought  to  reflect  coarticulation 
effects.  In  order  to  detect  and  characterize  any 
coarticulation,  a  multivariant  analysis  of  variance  was 
calculated  for  each  consonant-vowel  combination,  and  the 
corresponding  P  Value  was  also  listed  in  the  Tables  3  through 
8.  In  Figures  24,  25,  and  26,  several  groups  of  vowels  with 
the  same  initial  consonant  are  labeled  to  provide  a  graphical 
illustration  of  this  coarticulation  effect.  Also,  a  summary  of 
the  significant  consonant  vowel  coarticulation  is  given  in 
Table  9.  A  systematic  study  of  such  coarticulation  effects  is 
proposed  in  the  section  titled  Future  Work. 

In  order  to  assess  the  statistical  differences  between 
neighboring  vowel  pairs  /EH,  AE/,  /AE,  AH/,  and  /AH,  AO/,  a 
multivariant  analysis  of  variance  was  calculated  for  each 
subject.  The  resulting  P  Values  were  listed  in  Table  10.  In 
each  case,  at  least  one  of  the  two  sensitivity  elements  has  a 
corresponding  P  Value  of  .0000.  This  result  is  consistent 
with  the  nonoverlapping  nature  of  the  sensitivity  elements 
that  comprise  each  vowel  group.  A  summary  of  the  minimum 
per-cent  Euclidean  separation  between  these  vowel  groups,  in 
the  S(3,l)  vs.  S(3,3)  space,  is  shown  in  Table  11  and 
provides  another  description  of  their  nonoverlapping  nature. 
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TABLE  3.  MEAN  AND  STANDARD  DEVIATION  FOR  VOWEL  AND  CONSONANT  VOWEL  COMBINATIONS 
AND  MULTIVARIANT  ANALYSIS  OF  VARIANCE  FOR  CONSONANT  VOWEL  COMBINATIONS 


SUBJECT 

SENSITIVITY 

ELEMENT 

MEAN 

IY 

(STANDARD  DEVIATION) 
BIY  DIY 

GIY 

P  VAUJE 

1 

S(3,l) 

.0203 

.0211 

.0193 

.0206 

V 

• 

H-* 

O 

(.0045) 

(.0046) 

(.0059) 

(.0040) 

S(3,3) 

2.08 

2.21 

2.03 

2.01 

>.10 

(.28) 

(.40) 

(.26) 

(.12) 

2 

S(3,l) 

.0210 

.0225 

.0217 

.0189 

>.10 

(.0051) 

(.0046) 

(.0067) 

(.0043) 

S(3,3) 

2.30 

2.47 

2.41 

2.01 

>.10 

(.34) 

(.22) 

(.20 

(.41) 

3 

S(3,l) 

.0156 

.0185 

.0192 

.0090 

.0467 

(.0069) 

(.0041) 

(.0083) 

(.0017) 

S (3, 3) 

2.19 

2.27 

2.23 

2.06 

>.10 

(.62) 

(.57) 

(.92) 

(.44) 

4 

S  (3 , 1) 

.0169 

.0217 

.0187 

.0105 

.0499 

jj 

(.0071) 

(.0063) 

(.0073) 

(.0018) 

j 

S(3,3) 

2.49 

2.31 

2.48 

2.68 

>.10 

(.38) 

(.28) 

(.33) 

(.51) 

5 

S(3,l) 

.0134 

.0145 

.0107 

.0151 

>.10 

(.0038) 

(.0045) 

(.0024) 

(.0032) 

S(3,3) 

2.60 

2.52 

2.58 

2.68 

>.10 

* 

(.28) 

(.27) 

(.11) 

(.44) 
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TABLE  4.  MEAN  AND  STANDARD  DEVIATION  FOR  VOWEL  AND  CONSONANT  VOWEL  COMBINATIONS 
AND  MULTIVARIANT  ANALYSIS  OF  VARIANCE  FOR  CONSONANT  VOWEL  COMBINATIONS 


SENSITIVITY  MEAN  (STANDARD  DEVIATION) 

SUBJECT  ELEMENT  EH  BEH  DEH  GEH  P  VALUE 


1 

S(3,l) 

S(  3,3) 

.0814 

(.0115) 

1.97 

(.12) 

.0828 

(.0156) 

1.91 

(.11) 

.0764 

(.0134) 

1.96 

(.08) 

.0852 

(.0026) 

2.06 

(.13) 

>.10 

>.10 

2 

S(3,l) 

.0871 

.0991 

.0818 

.0803 

.0021 

(.0103) 

(.0049) 

(.0035) 

(.0079) 

S(3,3) 

1.48 

1.34 

1.54 

1.57 

.0118 

f, 

fc; 

(.13) 

(.05) 

(.06) 

(.14) 

3 

S(3,l) 

.0572 

.0 

649 

.0582 

.0484 

.0003 

(.0078) 

(.0 

027 

)  (.0044) 

(.0033) 

S(3,3) 

1.70 

1.5 

3 

1.67 

1.90 

.0241 

(.21) 

(.0 

3) 

(.04) 

(.27) 

4 

S(3,l) 

S(3,3) 

.0575 

(.0137) 

2.47 

(.31) 

.0567 

(.0125) 

2.24 

(.31) 

.0457 

(.0088) 

2.66 

(.27) 

.0700 

(.0074) 

2.50 

(.25) 

.0200 

>.10 

5 

S(3, 1) 

.0933 

.0913 

.0809 

.1078 

.0388 

(.0161) 

(.0103) 

(.0059) 

(.0180) 

S(3,3) 

1.66 

1.62 

1.62 

1.75 

.0827 

(.10) 

(.04) 

(.05) 

(.12) 
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TABLE  5.  MEAN  AND  STANDARD  DEVIATION  FOR  VOWEL  AND  CONSONANT  VOWEL  COMBINATIONS 
AND  MULTIVARIANT  ANALYSIS  OF  VARIANCE  FOR  CONSONANT  VOWEL  COMBINATIONS 


SENSITIVITY 


S(3,l) 

S(3,3) 


S(3,l) 
S  ( 3 , 3) 


S(3,l) 

S(3,3) 


MEAN  (STANDARD  DEVIATION) 


SUBJECT 

ELEMENT 

AE 

BAE 

DAE 

GAE 

P  VALUE 

1 

S(3,l) 

.1670 

.1705 

.1598 

.1708 

>.10 

(.0254) 

(.0351) 

(.0149) 

(.0285) 

S(3,3) 

1.74 

1.84 

1.66 

1.71 

>.10 

(.18) 

(.09) 

(.28) 

(.12) 

-1283 

(.0082) 

1.58 

(.14) 


.1218 

(.0160) 

1.52 

(.08) 


.0997 

(.0168) 

2.16 

(.33) 


.133' 
(.004' 
1.50 
(.08) 


.1330 

(.0131) 

1.51 

(.09) 


.1051 

(.0169) 

1.98 

(.22) 


.1233 

.0120) 

.63 

.21) 


.1197 

(.0166) 

1.48 

(.01) 


.0899 

(.0060) 

2.04 

(.10) 


.1282 

(.0039) 

1.62 

(.09) 

.1127 

(.0145) 

1.59 

(.10) 

.1042 

(.0227) 

2.45 

(.40) 


5 

S(3,l) 

.1518 

(.0121) 

S(3,3) 

1.70 

(.13) 

.0729 


.1559 

>.io  ! 

j 

.20) 

(.0176) 

> 

1.78 

>.10 

’) 

(.14) 
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TABLE  6.  MEAN  AND  STANDARD  DEVIATION  TOR  VOWEL  AND  CONSONANT  VOWEL  COMBINATIONS 
AND  MULTIVARIANT  ANALYSIS  OF  VARIANCE  FOR  CONSONANT  VOWEL  COMBINATIONS 


SENSITIVITY 
SUBJECT  ELEMENT 


MEAN  (STANDARD  DEVIATION) 


.112 
(.038 
.730 


.0136) 


.0 


(.0088) 


.0519 

.0664 

.0088) 

(.0034) 

..26 

.943 

• 

.15) 

(.044) 

0490 

.0756 

.0350 

.0 

0204) 

(.0090) 

(.0043) 

(.01 

841 

.693 

1.02 

.81 

161) 

(.037) 

(.11) 

(.O’ 

.0472 

.0548 

.0407 

.0461 

(.0083) 

(.0095) 

(.0029) 

(.0042 

1.47 

1.21 

1.68 

1.51 

(.25) 

(.12) 

(.17) 

(.19) 

P  VALUE 


71 

.0039 

69) 

.0004 

) 

'9 

.0387 

.0512 

.0002  j 

r 6 ) 

(.0029) 

(.0059) 

r 

1.158 

.984 

.0006 

» 

(.035) 

(.018) 

Page  58 


TABLE  7.  MEAN  AND  STANDARD  DEVIATION  FOR  VOWEL  AND  CONSONANT  VOWEL  COMBINATIONS 
AND  MULTIVARIANT  ANALYSIS  OF  VARIANCE  FOR  CONSONANT  VOWEL  COMBINATIONS 


SENSITIVITY 


(STANDARD  DEVIATION) 


3 

S(3,l) 

.132 

.132 

• 

(■ 

(.029) 

(.037) 

(. 

S(3,3) 

1.05 

.973 

1. 

i 

(.10) 

(.096) 

(. 

4 

S(3,l) 

•  . 

(.' 

S(3,3) 

l.: 

(.: 

.64 
21) 
1.11 
(.04) 


i 


.126 

(.002) 

1.23 

(.06) 


.156 

(.017) 

1.05 

(.10) 


.192 

(.041) 

1.30 

(.08) 


>  5 

S(3,l) 

.187 

• 

| 

(.037) 

(. 

i 

S(3,3) 

1.12 

• 

1 

(.12) 

(. 

.0675 

.0512 


.0219 

.0059 


f 

i  * 

I  j 

|  TABLE  8.  MEAN  AND  STANDARD  DEVIATION  FOR  VOWEL  AND  CONSONANT  VOWEL  COMBINATIONS 

\  AND  MULTIVARIANT  ANALYSIS  OF  VARIANCE  FOR  CONSONANT  VCWEL  COMBINATIONS 

,  SENSITIVITY  MEAN  (STANDARD  DEVIATION) 


SUBJECT 

ELEMENT 

AO 

BAD 

DAD 

GAD 

P  VALUE 

1 

S(3,l) 

.454 

.500 

.428 

.434 

.0907 

(.053) 

(.025) 

(.064) 

(.035) 

S(3,3) 

.958 

.844 

1.015 

1.014 

.0047 

(.100) 

(.031) 

(.075) 

(.068) 

2 

S(3,l) 

.322 

.342 

.313 

.309 

>.10 

(.026) 

(.023) 

(.026) 

(.020) 

S(3,3) 

.755 

.698 

.782 

.785 

.0733 

(.063) 

(.020) 

(.070) 

(.055) 

3 

S(3,l) 

.168 

.173 

.174 

.158 

>.10 

(.020) 

(.020) 

(.029) 

(.007) 

S(3,3) 

.708 

.652 

.708 

.765 

.0478 

(.069) 

(.028) 

(.075) 

(.050) 

4 

S(3,l) 

.431 

.447 

.441 

.406 

>.10 

(.069) 

(.096) 

(.049) 

(.068) 

S(3,3) 

.865 

.809 

.880 

.906 

.0181 

(.055) 

(.036) 

(.027) 

(.051) 

5 

S(3,l) 

.433 

.428 

.495 

.377 

>.10 

(.086) 

(.078) 

(.104) 

(.029) 

S(3, 3) 

.861 

.828 

.862 

.892 

>.10 

(.070) 

(.110) 

(.035) 

(.045) 

TABLE  9.  CONSONANT  VOWEL  COARTICULATION 


SENSITIVITY  NUMBER  OF  SUBJECTS  WITH  P  VALUE  <.10 

SUBJECT  ELEMENT  IY  EH  AE  AH  AO  UW 


All  five 

S(3,l) 

2  4 

0  5 

5 

S(3,3) 

0  3 

1  5 

5 

TABLE  10. 

MULTIVARIANT  ANALYSIS 

OF  VARIANCE  FOR  VOWEL  PAIRS 

SENSITIVITY 

P  VALUE  FOR  VOWEL  PAIR 

SUBJECT 

ETLEMEOT 

AH-AE 

AE-AH 

AH-AO 

1 

S(3,l) 

.0000 

.0102 

.0000 

S(3,3) 

.0011 

.0000 

.0000 

2 

S(3,l) 

.0000 

.0000 

.0000 

S(3,3) 

.0000 

.0000 

.0000 

3 

S(3,l) 

.0000 

>.10 

.0010 

S(3,3) 

.0028 

.0000 

.0000 

4 

S(3,l) 

.0000 

.0000 

.0000 

S(3,3) 

.0126 

.0000 

.0000 

5 

S(3,l) 

.0000 

.0001 

.0000 

S(3,3) 

>.10 

.0000 

.0000 

TABLE  11.  MINIMUM  PER-CENT  EUCLIDEAN  SEPARATION,  IN  SENSITIVITY  SPACE, 
BETWEEN  VCWEL  GROUPS 

MINIMUM  PER-CENT  EUCLIDEAN  SEPARATION 
EH-AE  AE-AH  AH-AO 


SUBJECT 


Discussion. 

This  research  program  was  focused  on  a  selected  set  of 
six  vowels  contained  in  single  words  spoken  in  a  simple 
carrier  phrase  by  five  males  with  differing  dialects.  The 
objectives  were  to  evaluate  the  sensitivity  matrix,  interpret 
its  changes  during  the  production  of  the  vowels,  and  to 
evaluate  inter-speaker  variations. 

As  described  in  the  Methods  section,  each  test  word  was 
divided  into  successive  frames.  For  each  frame,  the  co- 
variance  method  was  used  to  determine  the  coefficients  of  an 
18  pole  speech  model.  From  the  corresponding  signal  spectrum, 
the  first  three  formant  frequencies  were  located  using  a  peak¬ 
picking  algorithm.  These  general  purpose  signal  processing 
methods  need  further  study  and  development.  In  particular, 
there  are  occasional  frames  within  the  vowel  portion  of  a  word 
where  the  signal  processing  results  seem  inexplicable.  An 
example  of  such  a  "glitch"  is  labeled  gl  in  Figure  23. 

Under  the  hypothesis  of  a  lossless  model,  these  first 
three  formant  frequencies  were  used  to  calculate  Corner  Ratio 
1,  Corner  Ratio  2,  and  the  sensitivity  elements  S(3,l)  and 
S(3,3)  for  each  frame.  As  defined  by  Equations  11  and  12, 
Corner  Ratio  1  is  large  when  formants  1  and  2  are  close; 
whereas,  Corner  Ratio  2  is  large  when  formants  2  and  3  are 
close.  Comparing  Corner  Ratio  1  curves  for  the  three  front 
vowels  in  Figure  16  shows  the  changes  tht  occur  during  the 
vowel  portion  of  a  word  and  shows  the  progressive  formant  1, 
formant  2  shifts  that  occur  among  the  high  /IY/  to  low  /AE/ 
front  vowels.  Figure  18  shows  similar  Corner  Ratio  1  results 
for  the  group  of  back  vowels.  Comparing  Corner  Ratio  2  curves 
for  the  three  front  vowels  in  Figure  17  shows  they  are 
generally  large,  rise  to  a  maximum,  and  then  slowly  fall.  As 
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a  contrast,  Figure  19  shows  the  corresponding  Corner  Ratio  2 
curves  for  the  back  vowels.  For  this  subject,  there  is  a 
clear  minimum  with  progressive  shift  from  the  high  /UW/  to  low 
/AO/  back  vowels.  The  general  character  of  these  curves  and 
their  differences  across  vowel  groups  has  proved  to  be 
important . 

It  was  found  that  the  vowel  interval  could  be  segmented 
by  placing  limits  on  the  values  of  Corner  Ratio  1  and  Corner 
Ratio  2  and  on  their  smoothness.  It  was  also  found  that 
Corner  Ratio  1  and  Corner  Ratio  2  could  be  used  to  determine 
when  a  vowel  was  "on  target"  by  describing  the  location  of  the 
formants  relative  to  the  root-locus  corners  CRl  and  CR2  shown 
in  Figure  14.  Algorithm  1  selected  a  "target"  frame  based  on 
the  behavior  of  Corner  Ratio  1.  For  some  speakers,  the 
sensitivity  elements  of  the  selected  frame  may  be  used  to 
accurately  identify  each  vowel.  However,  nearest  neighbor 
ambiguity  may  arise  if  a  speaker's  dialect  does  not  clearly 
distinguish  between  phonemes,  if  a  speaker  has  a  greater 
tendency  to  coarticulate,  or  if  a  speaker  tends  to 
diphthongize  speech  productions. 

The  results  presented  in  Figures  24  through  28  were 
obtained  by  a  two-level  identification  scheme  where  the  second 
level  used  a  vowel-pair-specific  form  of  Algorithm  2  that 
examined  the  behavior  of  Corner  Ratio  2.  For  each  of  the 
speakers,  all  6  vowel  groups  are  nonoverlapping.  Tables  10 
and  11  demonstrate  this  separation  in  statistical  and 
geometric  terms.  The  minimum  per-cent  Euclidean  separation 
varies  across  speakers  as  well  as  vowel-pairs.  In  the  case  of 
Subject  1,  the  9  per-cent  separation  between  AE  and  AH  is  due 
to  a  "single  wild  point."  For  speaker  5,  the  11  per-cent 
separation  between  EH  and  AE  appears  to  be  due  to 
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coarticulation  in  words  that  begin  with  /g/.  Subjects  3  and  4 
have  cases  of  9  per-cent  separation  that  appear  to  be  due  to 
their  individual  dialects.  The  EH-AE  separation  for  speaker 
4,  from  Virginia,  appears  to  be  reduced  because  of  co¬ 
articulation  and  a  tendency  to  diphthongize  all  productions. 
Speaker  3,  from  Rhode  Island,  had  a  small  AH-AO  separation 
because  he  used  a  consistent  /£)/  production  for  /AO/. 

The  results  presented  in  Figures  24  through  28  also 
provide  a  graphical  view  of  the  inter-speaker  variations  as 
well  as  the  inter-speaker  similarities.  The  differences  in 
relative  location  of  /AO/  for  speakers  1  and  3  are  consistent 
with  the  "live"  judgment  of  the  expert  phoneticians;  toward 
/AH/  for  speaker  3  and  /AA/  for  speaker  1.  These  results 
reflect  "phonetic  facts  of  life"  and  indicate  that  accurate 
phoneme  identification  will  require  a  "training”  set  of  data 
for  each  speaker. 

In  summary,  the  specific  objectives  have  been  met.  The 
sensitivity  matrix  was  evaluated  for  the  set  of  test  words  and 
speakers.  It  was  found  that  Corner  Ratio  1  and  Corner  Ratio  2 
can  be  used  to  (a)  segment  the  vowel  interval  and  (b)  locate 
when  a  vowel  is  "on  target."  It  was  also  found  that  the 
sensitivity  elements  S(3,l)  and  S(3,3)  of  the  "on  target" 
frame  should  provide  sufficient  information  to  accurately 
identify  each  vowel  within  the  test  set. 

VI.  FUTURE  WORK: 

Any  particular  method  of  segmentation  and  identifica¬ 
tion  of  phonemes  should  be  challenged  by  speech  material  which 
presents,  in  both  a  controlled  and  naturalistic  manner,  as 
many  factors  known  to  cause  acoustic-phonetic  variation  as 
possible.  A  realistic  expansion  of  the  set  of 
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phonemes  would  include  the  unvoiced  stop  consonants  /p,  t,  k/ 
and  the  nasal  consonants  /m,n/.  The  six  stop  consonants 
have  the  same  manner,  different  voicing,  differing  place  and 
should  induce  substantial  coarticulatory  variations. 

Likewise,  the  nasal  consonants  should  induce  substantial 
coarticulatory  variations  in  the  neighboring  vowels. 

The  segmentation  and  identification  studies  would  also  be 

expanded  to  include  both  the  consonants  and  the  vowels.  With 

this  broader  class  of  sounds,  it  is  anticipated  that  it  will 

be  necessary  to  use  improved  spectral  estimation  techniques  to 

29-32 

obtain  appropriate  pole-zero  models.  The  root 

sensitivity  analysis  currently  used  for  the  formants  can  also 
be  used  for  the  zeros  in  the  speech  model. 

The  sequence  of  studies  could  be  described  as  follows. 
Task  I  should  investigate  improved  spectral  estimation 
techniques  for  pole-zero  models,  evaluate  the  corresponding 
sensitivity  matrix  and  the  changes  that  occur  during  the 
production  of  the  vowels  in  the  words,  test  the  conclusions 
reached  in  the  current  study,  and  evaluate  inter-speaker 
variations.  Task  II  should  evaluate  and  interpret  changes  in 
the  sensitivity  matrix  for  the  vowels  due  to  coarticulation. 
Task  III  should  evaluate  and  interpret  the  sensitivity  matrix 
for  the  differing  initial  stop  and  nasal  consonants.  Finally, 
Task  IV  should  use  the  above  results  to  build  a  reference 
library,  and  should  evaluate  the  efficacy  of  the  sensitivity 
matrix  in  terms  of  the  accuracy  of  the  resulting  phonetic 
representation  of  unknown  speech. 
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Based  on  the  results  described  in  this  report,  a  paper 
will  be  submitted  for  publication  in  the  Journal  of  the 
Acoustical  Society  of  Amer ica .  The  proposed  authors  and  title 
are:  Richard  G.  Absher  and  Thomas  Saunders,  "Sensitivity 

Based  Segmentation  and  Identification  of  Vowels  in  Continuous 
Speech . " 


